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Abstract 



Background: Chromatin immunoprecipitation combined with DNA microarrays (ChlP-chip) is an assay for 
DNA-protein-binding or post-translational chromatin/histone modifications. As with all high-throughput 
technologies, it requires a thorough bioinformatic processing of the data for which there is no standard yet. The 
primary goal is the reliable identification and localization of genomic regions that bind a specific protein. The 
second step comprises comparison of binding profiles of functionally related proteins, or of binding profiles of the 
same protein in different genetic backgrounds or environmental conditions. Ultimately, one would like to gain a 
mechanistic understanding of the effects of DNA binding events on gene expression. 
Results: We present a free, open-source R package Starr that, in combination with the package Ringo , 
facilitates the comparative analysis of ChlP-chip data across experiments and across different microarray 
platforms. Core features are data import, quality assessment, normalization and visualization of the data, and 
the detection of ChlP-enriched genomic regions. The use of common Bioconductor classes ensures the 
compatibility with other R packages. 

Conclusion: Starr is an R package that enables flexible analysis of a wide range of ChlP-chip experiments, in 
particular for Affymetrix data. Most importantly, Starr provides methods for integration of complementary 
genomics data, e.g., it enables systematic investigation of the relation between gene expression and DNA 
binding. 
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Background 

ChlP-chip is a technique for identifying Protein-DNA interactions. For this purpose, the chromatin is 
immunoprecipitated with an antibody to the protein of interest and the fragmented, protein-bound DNA is 
analyzed with tiling arrays [1]. Before the results can be analyzed, some bioinformatics methods must be 
applied to ensure the quality of the experiments and preprocess the data. 

Here we present the open-source software package Starr , which is available as part of the opensource 
bioconductor project [2]. It is an extension package for the programming language and statistical 
environment R [3] . Starr facilitates the analysis of ChlP-chip data, in particular it supports experiments 
that have been performed on the Affymetrix™ platform. Its functionality includes data acquisition, 
quality assessment and data visualization. Starr provides new functions for high level data analysis, e.g., 
association of ChIP signals with annotated features, gene filtering, and the combined analysis of the ChIP 
signals and other data like gene expression measurements. It uses the standard data structures for 
microarray analyses in Bioconductor, building on and fully exploiting the package Ringo [4]. The latter 
implements algorithms for smoothing and peak-finding, as well as low level analysis functions for 
microarray platforms such as Nimblegen and Agilent. 

Results and discussion 

We demonstrate the utility of Starr by applying it to a yeast RNA-Polymerase II (PolII for short) ChIP 
experiment. We discuss the question whether constitutive mRNA expression is mainly determined by the 
PolII recruitment rate to the promoter. 

Data acquisition, quality assessment and normalization 

We facilitated data import as much as possible, since in our experience, this is a major obstacle for the 
widespread use of R packages in the field of ChlP-chip analysis. The import of data from the microarray 
manufacturers Nimblegen and Agilent has already been implemented in Ringo , the common array 
platform Affymetrix is covered by Starr . There are two kinds of files that must be known to Starr : the 
.bpmap file which contains the mapping of the reporter sequences to its physical position on the array and 
the .eel files which contain the actual measurement values. All data, no matter from which platform, are 
stored in the common Bioconductor object ExpressionSet, which makes them accessible to a number of 
algorithms operating on that data structure. An R script reproducing the entire results of this paper, 
together with the data stored as RData objects can be found in the supplements. ChlP-chip data of yeast 
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PolII binding was published by Venters and Pugh in 2009 [5] and is available on array express under the 
accession number E-MEXP-1676. The gene expression data used here is available under accession number 
E-MEXP-2123. Transcription start and termination sites were obtained from David et al. [6]. 
The obligatory second step in the analysis protocol is quality control. The complex experimental 
procedures of a ChlP-chip assay make errors almost inevitable. A special issue of Affymetrix oligo arrays is 
the bias caused by the GC-content of the oligomer probes [7] . Starr displays the average expression of 
probes as a function of their GC-content, and it calculates a position-specific bias of every nucleotide in 
each of the 25 positions within the probe (see Figure 1). Moreover, Starr provides many other quality 
control plots like an in silico reconstruction of the physical array image to identify flawed regions on the 
array, or pairwise MA-plots, boxplots and heat-scatter plots to visualize pairwise dependencies within the 
dataset. 

For the purpose of bias removal (normalization), Starr interfaces the packages limma and rMAT, the latter 
of which implements the MAT algorithm [8] . But it also contains proper normalization methods like the 
median-rank-percentile normalization, which was originally proposed by Buck and Lieb in 2004 [9]. 

Visualization and high-level analysis 

Starr provides functions for the visualization of a set of "profiles" (e.g. time series, signal levels along 
genomic positions). Figure 2 shows the ChIP profile of PolII along the transcription start site of genes 
whose mRNA expression according to [10] ranges in the least 20% resp. the top 10% of all yeast genes (the 
cutoffs were chosen such that within both groups, the number of genes having an annotated transcription 
start site was roughly the same). The common way of looking at the intensity profiles is to calculate and 
plot the mean intensity at each available position along the region of concern. Such an illustration however 
may hide more than it reveals, since it fails to capture the variability at each position. It is desirable to 
display this variability in order to assess whether a seemingly obvious alteration in DNA binding is 
significant or not. Accordingly, our profileplot function relates to the conventional mean value plot like a 
box plot relates to an individual sample mean: Let the profiles be given as the rows of a samples x 
positions matrix that contains the respective signal of a sample at a given position. Instead of plotting a 
line for each profile (e.g. column of the row), the q-quantiles for each position (e.g. column of the matrix) 
are calculated, where q runs through a set of representative quantiles. Then for each q, the profile line of 
the q-quantiles is plotted. Color coding of the quantile profiles aids the interpretation of the plot: There is 
a color gradient from the median profile to the (=min) resp. 1 (=max) quantile. Another useful 
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high-level plot in Starr is the correlationPlot, which displays the correlation of a gene-related binding signal 
to its corresponding gene expression. Figure 3 shows a plot in which the mean PolII occupancy in various 
transcript regions of 2526 genes is compared to the corresponding mRNA expression. Each region is 
defined by its begin and end position relative to the transcription start site (start sites are taken from [6]). 
The regions are plotted in the lower panel of Figure 3. For each region, the correlation between the vector 
of mean occupancies and the vector of gene expression values is calculated and shown in the upper panel. 

Results interpretation 

Figs 2 and 3 supply ambiguous evidence for the role of PolII recruitment in basal transcription: The profile 
plots suggest that a high PolII occupancy at the initiation region of a gene is a necessary prerequisite for a 
high mRNA expression level. As opposed to this, the correlation plot reveals that PolII occcupancy at the 
transcription start is not a good predictor of mRNA expression, but the mean occupancy of PolII in the 
elongation phase (region 4 in Fig. 3) is. Nevertheless, a more detailed analysis of particular gene groups, and 
a comparison of PolII profiles under different environmental conditions might yield valuable new insights. 

Conclusion 

Starr is a Bioconductor package for the analysis of ChlP-chip experiments, in particular of Affymetrix 
tiling arrays. It exploits the full functionality of Ringo for the analysis of Affymetrix tiling arrays. These 
include functions like peak finding, smoothing or plotting genomic regions. Starr adds new analysis and 
visualization methods, which can also be applied to two-color technologies. It utilizes standard 
Bioconductor object classes and can thus easily interface other Bioconductor packages. All functions and 
methods in the package are well documented in help pages and in a vignette, which illustrates a workflow 
by means of some example data. Support is provided by the bioconductor mailing list and the package 
maintainer. 

Altogehter, Starr in conjunction with Ringo constitute a powerful and comprehensive tool for the analysis 
of tiling arrays across established one- and two-color technologies like Affymetrix, Agilent and Nimblegen. 

Availability and requirements 

The R-package Starr is available from the Bioconductor web site at http:/ /www. bioconductor. org and runs 



on Linux, Mac OS and MS-Windows. It requires an installed version of R (version >= 2.10.0), which is 
freely available from the Comprehensive R Archive Network (CRAN) at |http://cran.r-project.org| and 
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other Bioconductor packages, namely Ringo, affy, affxparser, rMAT and vsn plus the CRAN package 
pspline and MASS. The easiest way to obtain the most recent version of the software, with all its 
dependencies, is to follow the instructions at http:// www.bioconductor.org/download. Starr is distributed 
under the terms of the Artistic License 2.0. 
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Figures 

Figure 1 - Hybridization bias 

Sequence-specific dependency of raw reporter intensities. (A) Boxplots of probe intensity distributions. 
Probes are grouped according to the C/C content in their sequence. The median intensity increases with 
rising G/C-content. (B) Position-dependent mean probe intensity. Each letter corresponds the mean 
intensity of all probes that contain the corresponding nucleotide in the respective position. 

Figure 2 - Poll I along the transcriptional start site 

Profiles of PolII occupancy of genes with low (least 20%) resp high (top 10%) transcription rates (cluster 1 
resp. cluster 2). The upper graphs show the mean occupancy calculated over each position along the 
transcription start site. The lower plots illustrate the variance in the two clusters. The black line indicates 
the median profile of all features. The color gradient corresponds to quantiles (from 0.05 to 0.95), and the 
first and third quartiles are shown as grey lines. The light grey lines in the background show the profiles of 
individual "outlier" features. 

Figure 3 - Correlation of PolII occupancy to gene expression 

Starr enables the systematic investigation of gene expression related to DNA binding. Figure 2 shows the 
correlation of the mean PolII occupancy within different regions along the transcript to gene expression. 
The lower panel shows the regions of interest relative to the transcription start site (TSS) and the 
transcription termination site (TTS). The upper panel shows the correlation of PolII occupancy to the 
gene expression of the corresponding regions. 
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